78 research outputs found

    Deep Learning for Period Classification of Historical Texts

    Get PDF
    In this study, we address the interesting task of classifying historical texts by their assumed period of writing. This task is useful in digital humanity studies where many texts have unidentified publication dates. For years, the typical approach for temporal text classification was supervised using machine-learning algorithms. These algorithms require careful feature engineering and considerable domain expertise to design a feature extractor to transform the raw text into a feature vector from which the classifier could learn to classify any unseen valid input. Recently, deep learning has produced extremely promising results for various tasks in natural language processing (NLP). The primary advantage of deep learning is that human engineers did not design the feature layers, but the features were extrapolated from data with a general-purpose learning procedure. We investigated deep learning models for period classification of historical texts. We compared three common models: paragraph vectors, convolutional neural networks (CNN), and recurrent neural networks (RNN). We demonstrate that the CNN and RNN models outperformed the paragraph vector model and supervised machine-learning algorithms. In addition, we constructed word embeddings for each time period and analyzed semantic changes of word meanings over time

    Processing Multi-Word Discourse Markers in Translation: English to Hebrew and Lithuanian

    Get PDF
    Purpose: It has been proved that multi-word expressions are of key importance in language generation and processing. They also could perform a function of discourse organization, and certain multi-word expressions operate as discourse markers. The purpose of the current research is to examine multi-word expressions used as discourse markers in TED talk English transcripts and compare them with their counterparts in their Lithuanian and Hebrew translations, identifying if English multi-word expressions used as discourse markers in social media texts remain multi-word expressions in Lithuanian and Hebrew translation and searching for reasons for the changes of discourse markers in translation. We follow the research question of how English multiword discourse markers are processed in Hebrew and Lithuanian translation

    Speaker Attitudes Detection through Discourse Markers Analysis

    Get PDF
    Speaker attitude detection is important for processing opinionated text. Survey data as such provide a valuable source of information and research for different scientific disciplines. They are also of interest to practitioners such as policymakers, politicians, government bodies, educators, journalists, and all other stakeholders with occupations related to people and society. Survey data provide evidence about particular language phenomena and public attitudes to provide a broader picture about the clusters of social attitudes. In this regard, attitudinal discourse markers play a central role in the sense that they are pointers to the speaker's attitudes

    Izrada OWL ontologije za prikaz, povezivanje i pretraživanje SemAF diskursnih oznaka

    Get PDF
    Linguistic Linked Open Data (LLOD) are technologies that provide a powerful instrument for representing and interpreting language phenomena on a web-scale. The main objective of this paper is to demonstrate how LLOD technologies can be applied to represent and annotate a corpus composed of multiword discourse markers, and what the effects of this are. In particular, it is our aim to apply semantic web standards such as RDF and OWL for publishing and integrating data. We present a novel scheme for discourse annotation that combines ISO standards describing discourse relations and dialogue acts – ISO DR-Core (ISO 24617-8) and ISO-Dialogue Acts (ISO 24617-2) in 9 languages (cf. Silvano and Damova 2022; Silvano, et al. 2022). We develop an OWL ontology to formalize that scheme, provide a newly annotated dataset and link its RDF edition with the ontology. Consequently, we describe the conjoint querying of the ontology and the annotations by means of SPARQL, the standard query language for the web of data. The ultimate result is that we are able to perform queries over multiple, interlinked datasets with complex internal structure. This is a first, but essential step, in developing novel, powerful, and groundbreaking means for the corpus-based study of multilingual discourse, communication analysis, or attitudes discovery.Diskursni markeri jezični su znakovi koji pokazuju kako se iskaz odnosi na kontekst diskursa i koju ulogu ima u razgovoru. Lingvistički povezani otvoreni podatci (LLOD) tehnologije su u nastajanju koje omogućuju snažan instrument za prikaz i tumačenje jezičnih fenomena na razini weba. Glavni je cilj ovoga rada pokazati kako se tehnologije lingvistički povezanih otvorenih podataka (LLOD) mogu primijeniti za prikaz i označavanje korpusa višerječnih diskursnih markera te koji su učinci toga. Konkretno, naš je cilj primijeniti standarde semantičkoga weba kao što su RDF i Web Ontology Language (OWL) za objavljivanje i integraciju podataka. Autori predstavljaju novu shemu za označavanje diskursa koja kombinira ISO standarde za opis diskursnih odnosa i dijaloških činova – ISO DR-Core (ISO 24617-8) i ISO-Dialogue Acts (ISO 24617-2) na devet jezika (usp. Silvano, Purificação et al. 2022a; Silvano, Purificação et al. 2022b). Razvijamo OWL ontologiju kako bismo formalizirali tu shemu, pružili nov označeni skup podataka i povezali njegovu RDF inačicu s ontologijom. U skladu s tim opisujemo zajedničko postavljanje upita ontologiji i oznakama s pomoću SPARQL-a, standardnoga jezika upita za web podataka. Konačni je rezultat taj da možemo izvršiti upite nad višestrukim, međusobno povezanim skupovima podataka sa složenom unutarnjom strukturom bez potrebe za ikakvim specijaliziranim softverom. Umjesto toga upotrebljavaju se gotove tehnologije utemeljene na web standardima koje se bez napora mogu prenijeti na različite operativne sustave, baze podataka i programske jezike. Ovo je prvi, ali prijeloman korak u razvoju novih, snažnih i (u određenom trenutku) pristupačnih sredstava za korpusno utemeljena istraživanja višejezičnoga diskursa te za analizu komunikacije i otkrivanje stavova

    LOD-Connected Offensive Language Ontology and Tagset Enrichment

    Get PDF
    CC BY 4.0The main focus of the paper is the definitional revision and enrichment of offensive language typology, making reference to publicly available offensive language datasets and testing them on available pretrained lexical embedding systems. We review over 60 available corpora and compare tagging schemas applied there while making an attempt to explain semantic differences between particular concepts of the category OFFENSIVE in English. A finite set of classes that cover aspects of offensive language representation along with linguistically sound explanations is presented, based on the categories originally proposed by Zampieri et al. [1, 2] in terms of offensive language categorization schemata and tested by means of Sketch Engine tools on a large web-based corpus. The schemata are juxtaposed and discussed with reference to non-contextual word embeddings FastText, Word2Vec, and Glove. The methodology for mapping from existing corpora to a unified ontology as presented in this paper is provided. The proposed schema will enable further comparable research and effective use of corpora of languages other than English. It will also be applied in building an enriched tagset to be trained and used on new data, with the application of recently developed LLOD techniques [3]

    Implicit Offensive Language Taxonomy and Its Application for Automatic Extraction and Ontology

    Get PDF
    Purpose: In this current study, we intend to explore varying forms of implicit (mostly figurative) offensiveness (e.g., irony, metaphor, hyperbole, etc.) in order to propose a linguistic taxonomy of implicit offensiveness (and how it permeates explicit forms), and an ontology of offensive terms readily applicable to fine-tuned, pre-trained language models (word and phrase embedding). Offensive language has recently attracted great attention from computational scientists (e.g., Zampieri et al., 2019) and linguists alike (e.g., Haugh & Sinkeviciute, 2019). While in NLP scholars focus on ways of automatic extraction of what is generally and most often referred to as toxic language, in linguistics the concept of hate speech is frequently explored. Implicit offensive language, however, as opposed to explicit offence, has received little scholarly attention which so far has focused solely on single and unrelated concepts/terms. This paper aims at proposing an overarching model where varying subtypes of implicitness used in the context of offensive language are conceptually linked (Bączkowska et al., 2022)

    An OWL ontology for ISO-based discourse marker annotation

    Get PDF
    Purpose: Discourse markers are linguistic cues that indicate how an utterance relates to the discourse context and what role it plays in conversation. The authors are preparing an annotated corpus in nine languages, and specifically aim to explore the role of Linguistic Linked Open Data (/LLOD) technologies in the process, i.e., the application of web standards such as RDF and the Web Ontology Language (OWL) for publishing and integrating data. We demonstrate the advantages of this approach

    ISO-based annotated multilingual parallel corpus for discourse markers

    Get PDF
    Discourse markers carry information about the discourse structure and organization, and also signal local dependencies or epistemological stance of speaker. They provide instructions on how to interpret the discourse, and their study is paramount to understand the mechanism underlying discourse organization. This paper presents a new language resource, an ISO-based annotated multilingual parallel corpus for discourse markers. The corpus comprises nine languages, Bulgarian, Lithuanian, German, European Portuguese, Hebrew, Romanian, Polish, and Macedonian, with English as a pivot language. In order to represent the meaning of the discourse markers, we propose an annotation scheme of discourse relations from ISO 24617-8 with a plug-in to ISO 24617-2 for communicative functions. We describe an experiment in which we applied the annotation scheme to assess its validity. The results reveal that, although some extensions are required to cover all the multilingual data, it provides a proper representation of discourse markers value. Additionally, we report some relevant contrastive phenomena concerning discourse markers interpretation and role in discourse. This first step will allow us to develop deep learning methods to identify and extract discourse relations and communicative functions, and to represent that information as Linguistic Linked Open Data (LLOD)

    Validation of language agnostic models for discourse marker detection

    Get PDF
    Using language models to detect or predict the presence of language phenomena in the text has become a mainstream research topic. With the rise of generative models, experiments using deep learning and transformer models trigger intense interest. Aspects like precision of predictions, portability to other languages or phenomena, scale have been central to the research community. Discourse markers, as language phenomena, perform important functions, such as signposting, signalling, and rephrasing, by facilitating discourse organization. Our paper is about discourse markers detection, a complex task as it pertains to a language phenomenon manifested by expressions that can occur as content words in some contexts and as discourse markers in others. We have adopted language agnostic model trained in English to predict the discourse marker presence in texts in 8 other unseen by the model languages with the goal to evaluate how well the model performs in different structure and lexical properties languages. We report on the process of evaluation and validation of the model's performance across European Portuguese, Hebrew, German, Polish, Romanian, Bulgarian, Macedonian, and Lithuanian and about the results of this validation. This research is a key step towards multilingual language processing
    corecore